Here is our work for this dataset named "Seoul Bike Sharing Demand".
This dataset contains count of public bikes rented at each hour in Seoul Bike sharing System with the corresponding Weather data and Holidays information.
Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.
Thus we will explore this dataset in order to show the link between the number of bikes rented and the other variables present in the dataset. We will then elaborate a machine learning model in order to provide a way to predict more or less the number ok bikes that could be rented under specific conditions.
You can find the dataset here: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand
You'll find:
Acknowledgement:
# Data manipulation libraries
import pandas as pd
import numpy as np
# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
# Interactive visualisation libraries
from ipywidgets import widgets
from ipywidgets import interact
#Sabry.abdellah@gmail.com
#Abdellah.sabry@ext.devinci.fr
df = pd.read_csv("SeoulBikeData.csv", encoding='latin1')
df.head()
df.info()
Every columns are in the correct format except the Date column, thus we'll convert it later.
df.describe()
df.isna().sum()
No information is missing.
print(df.groupby("Date")[["Hour","Seasons"]].count().shape[0],"Days in the year, which is correct. No days missing")
looking for outliers :
df[(df['Rented Bike Count']==0)][["Rented Bike Count","Functioning Day"]]
We see that when Rented Bike Count is equal to 0 it corresponds to non functionning day.
Maybe by looking to every non functionning day we can get more information :
df[df["Functioning Day"]=='No'].equals(df[(df['Rented Bike Count']==0)])
In fact there are 13 Days in the year that are non functionning. These days are representing all the days without bike rented. So we'll do some treatment on it
print("Number of days within the year :",len(df.groupby("Date")))
df[df["Functioning Day"]=='No'].groupby("Date").count()
We keep only functionning days. As we know that it will always be 0 for non Functionning days. Thus we drop the column.
df = df[df["Functioning Day"]!='No']
df.drop(columns=["Functioning Day"],inplace=True)
df.head()
print("Nombre de jours restant dans l'année :",len(df.groupby("Date")))
We have 353 days left, which corresponds to 365-12. There is no mistake. In fact if you look back at all the non functionning days, you'll notice that for 06/10/2018 there are only 7 hour in that day that aren't functionning.
Now we want to convert the date into the correct format :
df['Date'] = pd.to_datetime(df['Date'],format="%d/%m/%Y")
df.info()
Adding a month column in order to analyse more closely later :
df["Month"] = df["Date"].apply(lambda x: x.strftime("%B"))
df.tail()
We already know what to expect of standard columns such as date, hour, seasons and others. However some may be interesting to analyse aside from others. As they may not be that much variated. So we're looking at these columns more closely :
fig, ax = plt.subplots(figsize=(20,20),nrows=2,ncols=2)
df["Snowfall (cm)"].plot.hist(ax=ax[0,0],grid=True, bins=20, rwidth=0.9, color='#b9d9fa')
df["Rainfall(mm)"].plot.hist(ax=ax[0,1],grid=True,bins=20, rwidth=0.9, color='blue')
df["Visibility (10m)"].plot.hist(ax=ax[1,0],grid=True,bins=20, rwidth=0.9, color='grey')
df["Solar Radiation (MJ/m2)"].plot.hist(ax=ax[1,1],grid=True,bins=20, rwidth=0.9,xlabel="Solar Radiation (MJ/m2)",
color='red')
ax[0,0].set_title("Frequency of Snow fall in cm over the year")
ax[0,0].set_xlabel("Snowfall (cm)")
ax[0,1].set_title("Frequency of Rainfall in mm over the year")
ax[0,1].set_xlabel("Rainfall(mm)")
ax[1,0].set_title("Frequency of Visibility in 10m over the year")
ax[1,0].set_xlabel("Visibility (10m)")
ax[1,1].set_title("Frequency of Solar Radiation in MJ/m2 over the year")
ax[1,1].set_xlabel("Solar Radiation (MJ/m2)")
fig.suptitle('Histograms of different columns', fontsize=20,y=0.92);
Snow and Rain values aren't much varied, thus lets create 2 new columns for rain and snow to convert them into categorical variables
day = df.groupby("Date").mean()
print("Not snowy :",day[day["Snowfall (cm)"]==0].shape[0]," days")
print("Not rainny :",day[day["Rainfall(mm)"]==0].shape[0], "days")
df["Snow"] = (df["Snowfall (cm)"]!=0)
df["Rain"] = df["Rainfall(mm)"].apply(lambda x: "No rain"
if x==0 else ("light rain" if x<3
else ("Medium rain" if x<8
else "Strong rain")))
Now lets check the correlation between every variables :
corr= df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(10, 10))
ax = sns.heatmap(abs(corr), mask=mask, vmax=1, square=True,cmap="Reds")
ax.set_title("Correlation matrix of the Data set")
corr_matrix = df.corr()[['Rented Bike Count']].sort_values(by = ['Rented Bike Count'], ascending = False).drop(['Rented Bike Count'])
corr_matrix.style.background_gradient(cmap = 'coolwarm').format(precision=2)
For our regression model we will decide to drop dew point as it is too corrolated to temperature. And temperature is better correlated with rented bike count. we notice an improvement of the correlation between Snow info and Rented bike count, as it is now a boolean (Snowfall (cm) vs Snow).
fig = px.scatter(df.sort_values(by="Date"),x="Date",y="Rented Bike Count",color="Seasons",
title="Scatterplot of Rented Bike Count by every day (including every hour)")
fig.show()
We can cleary see that winter is the season that stands out of the others. The values are massed together from 0 to 500 bikes per hour. Which is pretty low compared to other seasons.
We can also see that there is much more volatility when we get closer to summer, values are really disparsed and can reach 3500 bikes per hour.
temp = df.groupby(["Holiday","Seasons"])["Rented Bike Count"].mean().reset_index()
fig = px.bar(temp,x="Holiday", y="Rented Bike Count", color="Seasons",
title="Barplot of the mean of rented bike count whether it is holiday or not, separated by seasons")
fig.show()
We decided to plot the mean as the number of vacation days is really low compared to working days.
Thus, we see that holidays don't have much influence on the mean of rented bike, even if there is a number of bikes rented in holidays that is a little bit lower, for each seasons.
fig = px.pie(df.groupby("Seasons")["Rented Bike Count"].sum().reset_index(),values="Rented Bike Count",
names='Seasons',title="Pie Chart representing the proportion of total rented bike by seasons")
fig.show()
As we expected from the previous graphs, winter doesn't even represent 1/10 of the total bikes rented. And Summer is the dominant season with more than 1/3 of the proportion.
temp1 = (df[df["Holiday"]=="No Holiday"].groupby(["Hour","Holiday"])["Rented Bike Count"].sum()*100/sum(df[df["Holiday"]=="No Holiday"]["Rented Bike Count"])).reset_index()
temp2 = (df[df["Holiday"]!="No Holiday"].groupby(["Hour","Holiday"])["Rented Bike Count"].sum()*100/sum(df[df["Holiday"]!="No Holiday"]["Rented Bike Count"])).reset_index()
temp = temp1.append(temp2)
temp.rename(columns={'Rented Bike Count': '% of rented bike count'}, inplace=True)
fig = px.bar(temp, x="Hour", y="% of rented bike count", color = "Holiday",
title='Proportion of rented bike per hour, taking in count if we are in holiday or not',
color_discrete_sequence=['indianred', 'lightsalmon'])
fig.update_layout(barmode='group')
fig.show()
Every columns for holiday adds up to 1, same for non holiday columns.
This graph enables us to understand the trend underlying holidays. As there are 10x more non holiday row, we don't see the impact if we ponderate the values and then can see the trend.
Conclusion: It follows the same stable pattern for both categories. A drop during the night. And a pick during afternoon. However we can see that for working days, 8h and 18h are much more dominant. this is explained by the beginning and the end of days for most workers. They are travelling home or going to work. Which of course isn't the same during holidays.
temp = df.groupby(["Rain","Hour"])["Rented Bike Count"].mean().reset_index()
fig = px.line(temp, x="Hour", y="Rented Bike Count", color='Rain',title='Line plot of the mean of rented bikes per hour taking in count the Rain')
fig.show()
It doesn't rain much in the year. Thus we use the mean of rented bikes to interpret the trend of this column.
We can clearly see that the rain affects the number of rented bikes. When it rains, people don't rent bikes and probably use public transports. However we can still distinguish the 8h and 18h pics when it rains.
temp = df.groupby(["Seasons","Hour"])["Rented Bike Count"].mean().reset_index()
fig = px.line(temp, x="Hour", y="Rented Bike Count", color='Seasons',title='Line plot of the mean of rented bikes per hour taking in count seasons')
fig.show()
Same as before, we clearly see the trend that happens at 8h and 18h every day. We also see that every seasons seems to follow the same pattern except Winter. However at night we see that there are more bike rented in summer, it may be due to the higher temperature at night and the sun going down later.
# temperature
temp = df.groupby(["Temperature(°C)","Seasons"])["Rented Bike Count"].sum().reset_index()
px.scatter(data_frame = temp
,x = 'Temperature(°C)'
,y = 'Rented Bike Count',color="Seasons",title='Scatter plot of the total number of bikes rented by temperature and season')
we can see that the temperature is very correlated to the rented bikes, as it gets higher and closer to 20 °C the number of bikes rented are higher in general. If it surpass 30 °C then it decreases drastically as the temperature is maybe to hot for people to go biking.
In winter temperatures are low thus rented bike number is low, and we get the same conclusion as before for the other seasons.
temp = df.groupby("Visibility (10m)")["Rented Bike Count"].mean().reset_index()
fig = px.scatter(temp, x="Visibility (10m)", y="Rented Bike Count",
hover_name="Rented Bike Count", size_max=60,trendline="ols",trendline_color_override="red", title='Scatter plot of the Rented bike mean by visibility, with ordinal least square trend line')
fig.show()
It not really helpful for us, it doesn't seems to be an important variable.
textbox = widgets.Dropdown(
description='Column: ',
value="Temperature(°C)",
options=["Visibility (10m)","Hour","Temperature(°C)","Humidity(%)","Wind speed (m/s)","Solar Radiation (MJ/m2)"]
)
X = df.groupby("Temperature(°C)")["Rented Bike Count"].mean().reset_index()["Temperature(°C)"]
Y = df.groupby("Temperature(°C)")["Rented Bike Count"].mean().reset_index()["Rented Bike Count"]
#Assign an empty figure widget with two traces
trace = px.scatter(x=X, y=Y,title="Rented bike mean, by temperature",labels={'x':"Temperature(°C)", 'y':'mean of bikes'})
g = go.FigureWidget(data=trace,
layout=go.Layout(
title=dict(
text='scatter')))
def validate():
if textbox.value in ["Visibility (10m)","Hour","Temperature(°C)","Humidity(%)","Wind speed (m/s)","Solar Radiation (MJ/m2)"]:
return True
else:
return False
def response(change):
if validate():
x1 = df.groupby(textbox.value)["Rented Bike Count"].mean().reset_index()[textbox.value]
y1 = df.groupby(textbox.value)["Rented Bike Count"].mean().reset_index()["Rented Bike Count"]
with g.batch_update():
g.data[0].x=x1
g.data[0].y=y1
g.layout.title = "Rented bike mean, by "+str(textbox.value)
g.layout.xaxis.title = textbox.value
g.layout.yaxis.title = 'mean of Rented bike'
textbox.observe(response, names="value")
container2 = widgets.HBox([textbox])
widgets.VBox([container2,g])
These graphs can help us visualize correlation for variables we didn't analyse yet.
@interact
def view_image(
col=widgets.Dropdown(
description="Month #1 :", value="July", options=df["Month"].unique()
),
filtercol=widgets.Dropdown(
description="Month #2 :", value="August", options=df["Month"].unique()
),
):
temp1 = (df[df["Month"]==col].groupby(["Hour","Month"])["Rented Bike Count"].sum()).reset_index()
temp2 = (df[df["Month"]==filtercol].groupby(["Hour","Month"])["Rented Bike Count"].sum()).reset_index()
temp = temp1.append(temp2)
newTitle= "Bar plot of the total number of bike rented by hour between "+col+" and "+filtercol
fig = px.bar(temp, x="Hour", y="Rented Bike Count", color="Month",title=newTitle).update_layout(barmode="group")
go.FigureWidget(fig.to_dict()).show()
This graph is really helpful when we want to compare two months together. Side by side
The goal of this project is to predict an output when we give a model some parameters. In our case, we want to predict how many bikes will likely be rented on given parameters (Hour, season...).
This is why we use machine learning : we will try different models that will predict the output with a certain accuracy. In the end, we aim to seek the best machine learning model that fits our dataset and that gives us the most accurate prediction. We will save it to a Pickle file in order to use it in our flask API.
Here are the different steps :
We used this format in order to show the results :
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
# models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RANSACRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
X = df[['Hour','Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)',
'Solar Radiation (MJ/m2)','Seasons','Holiday','Month','Snow','Rain']]
y = df['Rented Bike Count']
X.head()
In order to be able to train the models we need to dummify the categorical predictors
X["Holiday"] = X["Holiday"].map( {'No Holiday': 0, 'Holiday': 1} ).astype(int)
X["Rain"] = X["Rain"].map( {'Pas de pluie': 0, 'Pluie faible': 1,'Pluie modérée':2,'Pluie forte':3} ).astype(int)
#X["Month"] = X["Month"].map({"January":1, "February":2, "March":3, "April":4, "May":5,
# "June":6, "July":7, "August":8, "September":9, "October":10,
# "November":11, "December":12}).astype(int)
X["Snow"] = X["Snow"].map( {False: 0, True: 1} ).astype(int)
#We dummify Visibility as we concluded during analysis, same for solar radiation
X['Visibility (10m)']=X['Visibility (10m)'].apply(lambda x: 1 if x==2000 else 0)
X['Solar Radiation (MJ/m2)']=X['Solar Radiation (MJ/m2)'].apply(lambda x:1 if x>=0.5 else 0)
seasons = pd.get_dummies(X.Seasons)
months = pd.get_dummies(X.Month)
X = pd.merge(X, seasons, left_index=True, right_index=True)
X.drop(columns="Seasons",inplace=True)
X = pd.merge(X, months, left_index=True, right_index=True)
X.drop(columns="Month",inplace=True)
X.info()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)
def cross_val(model):
pred = cross_val_score(model, X, y, cv=10)
return pred.mean()
def print_evaluate(true, predicted):
mae = metrics.mean_absolute_error(true, predicted)
mse = metrics.mean_squared_error(true, predicted)
rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
r2_square = metrics.r2_score(true, predicted)
print('MAE:', mae)
print('MSE:', mse)
print('RMSE:', rmse)
print('R2 Square', r2_square)
print('__________________________________')
def evaluate(true, predicted):
mae = metrics.mean_absolute_error(true, predicted)
mse = metrics.mean_squared_error(true, predicted)
rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
r2_square = metrics.r2_score(true, predicted)
return mae, mse, rmse, r2_square
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train);
lin_test_pred = lin_reg.predict(X_test)
lin_train_pred = lin_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, lin_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, lin_train_pred)
results_df = pd.DataFrame(data=[["Linear Regression", *evaluate(y_test, lin_test_pred) , cross_val(LinearRegression())]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
rb_reg = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100)
rb_reg.fit(X_train, y_train)
rb_test_pred = rb_reg.predict(X_test)
rb_train_pred = rb_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rb_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rb_train_pred)
results_df_2 = pd.DataFrame(data=[["Robust Regression", *evaluate(y_test, rb_test_pred) , cross_val(RANSACRegressor())]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)
ridge_reg = Ridge(alpha=100, solver='cholesky', tol=0.0001, random_state=42)
ridge_reg.fit(X_train, y_train)
ridge_test_pred = ridge_reg.predict(X_test)
ridge_train_pred = ridge_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, ridge_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, ridge_train_pred)
results_df_2 = pd.DataFrame(data=[["Ridge Regression", *evaluate(y_test, ridge_test_pred) , cross_val(Ridge())]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)
lasso_reg = Lasso(alpha=0.1,
precompute=True,
# warm_start=True,
positive=True,
selection='random',
random_state=42)
lasso_reg.fit(X_train, y_train)
lasso_test_pred = lasso_reg.predict(X_test)
lasso_train_pred = lasso_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, lasso_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, lasso_train_pred)
results_df_2 = pd.DataFrame(data=[["Lasso Regression", *evaluate(y_test, lasso_test_pred) , cross_val(Lasso())]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
rf_test_pred = rf_reg.predict(X_test)
rf_train_pred = rf_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf_train_pred)
results_df_2 = pd.DataFrame(data=[["Random Forest Regressor", *evaluate(y_test, rf_test_pred), 0]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'])
results_df = results_df.append(results_df_2, ignore_index=True)
svm_reg = SVR(kernel='rbf', C=1000000, epsilon=0.001)
svm_reg.fit(X_train, y_train)
svm_test_pred = svm_reg.predict(X_test)
svm_train_pred = svm_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, svm_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, svm_train_pred)
results_df_2 = pd.DataFrame(data=[["SVM Regressor", *evaluate(y_test, svm_test_pred), 0]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df[['Model',"RMSE","R2 Square"]].sort_values(by='R2 Square').style.background_gradient(cmap = 'coolwarm')
f, ax = plt.subplots(figsize=(15, 10),nrows=2,ncols=1)
clrs = ['#5499c7' for x in range(len(results_df)) ]
sns.barplot(x="Model", y="RMSE", data=results_df,palette=clrs,ax=ax[0])
mask2 = results_df["RMSE"]==results_df["RMSE"].min()
sns.barplot(x="Model", y=results_df["RMSE"][mask2],data=results_df, color='green',ax=ax[0])
ax[0].set_title('RMSE by model')
sns.barplot(x="Model", y="R2 Square", data=results_df,palette=clrs,ax=ax[1])
mask2 = results_df["R2 Square"]==results_df["R2 Square"].max()
sns.barplot(x="Model", y=results_df["R2 Square"][mask2],data=results_df, color='green',ax=ax[1])
ax[1].set_title('R2 Square by model');
plt.figure(figsize=(10,6))
plt.scatter(rf_train_pred,rf_train_pred - y_train,
c = 'black', marker = 'o', s = 35, alpha = 0.5,
label = 'Train data')
plt.scatter(rf_test_pred,rf_test_pred - y_test,
c = 'c', marker = 'o', s = 35, alpha = 0.7,
label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper left')
plt.hlines(y = 0, xmin = 0, xmax = 3500, lw = 2, color = 'red')
plt.title("Scatter plot of the error of predictions")
plt.show()
This plot enables us to understand for which kind of values we have the most signiiificant errors. On the train set we see that the model is pretty stable, following the red line, errors are massed around 0 with max error of 500 bikes.
However for the test set we see that the error spreads a lot more when predicting big values, the maximum error is now higher to 1500 bikes. Again, it is pretty correct for values predicted between 0 and 1500.
Lets check if the errors observed are normal, if our model is correct, and we juste need to tune it or not.
plt.figure(figsize=(10,6))
sns.displot(y_test-rf_test_pred)
plt.title("repartition of the prediction error");
So it looks like a normal distribution, no outliers are present, it centered in 0. Thus our model is not incorrect we can continue with its improvment.
n_estimators = [1000]
max_features = ['auto']
max_depth = [int(x) for x in np.linspace(10,100,10)]
min_samples_split = [2,5,15,30]
min_samples_leaf = [1,2,5,20]
random_grid = {
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf
}
# First create the base model to tune
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error',
n_iter = 5, cv = 3, random_state=42, n_jobs = 1)
rf_random.fit(X_train,y_train);
rf_random_test_pred = rf_random.predict(X_test)
rf_random_train_pred = rf_random.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf_random_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf_random_train_pred)
rf2_reg = RandomForestRegressor(n_estimators=1000)
rf2_reg.fit(X_train, y_train)
rf2_test_pred = rf2_reg.predict(X_test)
rf2_train_pred = rf2_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf2_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf2_train_pred)
rf3_reg = RandomForestRegressor(n_estimators=1500)
rf3_reg.fit(X_train, y_train)
rf3_test_pred = rf3_reg.predict(X_test)
rf3_train_pred = rf3_reg.predict(X_train)
print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf3_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf3_train_pred)
results_df = pd.DataFrame(data=[["Random Forest", *evaluate(y_test, rf_test_pred)]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df_2 = pd.DataFrame(data=[["Random Forest tuned randomly", *evaluate(y_test, rf_random_test_pred)]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df_2 = pd.DataFrame(data=[["Random Forest (ntree=1000)", *evaluate(y_test, rf2_test_pred)]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)
results_df_2 = pd.DataFrame(data=[["Random Forest (ntree=1500)", *evaluate(y_test, rf3_test_pred)]],
columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)
f, ax = plt.subplots(figsize=(15, 10),nrows=2,ncols=1)
clrs = ['#5499c7' for x in range(len(results_df)) ]
sns.barplot(x="Model", y="RMSE", data=results_df,palette=clrs,ax=ax[0])
mask2 = results_df["RMSE"]==results_df["RMSE"].min()
sns.barplot(x="Model", y=results_df["RMSE"][mask2],data=results_df, color='green',ax=ax[0])
ax[0].set_ylim(200,240)
ax[0].set_title('RMSE by model')
sns.barplot(x="Model", y="R2 Square", data=results_df,palette=clrs,ax=ax[1])
mask2 = results_df["R2 Square"]==results_df["R2 Square"].max()
sns.barplot(x="Model", y=results_df["R2 Square"][mask2],data=results_df, color='green',ax=ax[1])
ax[1].set_ylim(0.6,0.9)
ax[1].set_title('R2 Square by model');
final_model = rf3_reg # rf with ntree = 1500
After analyzing the data and removing the irrelevant columns, we tried different kind of machine learning models to see which one suits our dataset the best. In the end, we can clearly see that the Random Forest model provides us the best results (r²:0.87 / RMSE:222) on the test set, and that is why we take the decision to use this model in our flask API.
After trying to improve this model, we finally concluded that the number of estimators was the only parameter of improvement for the model, it is great to increase it until 1500, as the improvement is not significant anymore after this level (increases the R2 square by 0.000001).
Our model is efficient when predicting values under 1500, when it is larger it doesn't match reality. This can be the result of a number of rows of high value (higher than 1500) not big enough to train the model correctly for huge values.
If we take the MAE metric, we can tell that our predictions are correct with +-139 bikes of error.
In fact if we look at the dataset, we see that 75% of the rows corresponds to values (number of bikes rented) under 1085. Which is too narrowed.
We trained our model on a dataset excluding values above 1500 and the MAE was divided by 2, without losing too much of R2 square. It shows that if we had a dataset with more diversed values for the target, then we would surely have better results.
import pickle
with open("model.pkl","wb") as f:
pickle.dump(final_model,f)
An example of the usage :
with open("model.pkl","rb") as f:
clf2 = pickle.load(f)
print("Predicted : ",round(clf2.predict(X_test)[20])," vs Actual : ",y_test.iloc[20])